02 - Data from the Web

Deadline

Wednesday October 25, 2017 at 11:59PM

Important Notes

  • Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
  • Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
  • Please write all your comments in English, and use meaningful variable names in your code.

Background

In this homework we will extract interesting information from www.toptop_universities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide top_universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

Assignment

  1. Obtain the 200 top-ranking top_universities in www.toptop_universities.com (ranking 2018). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the details page. Store the resulting dataset in a pandas DataFrame and answer the following questions:
  • Which are the best top_universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
  • Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

  1. Obtain the 200 top-ranking top_universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.

  2. Merge the two DataFrames created in questions 1 and 2 using university names. Match top_universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

  3. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

  4. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:

  • Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
  • In case of tie, use the order defined in the webpage.

In [1]:
import requests, re, html
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

from tqdm import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')

NUM_OBS = 200

www.topuniversities.com


In [2]:
root_url_1 = 'https://www.topuniversities.com'
# we use the link to the API from where the website fetches its data instead of BeautifulSoup
# much much cleaner
list_url_1 = root_url_1 + '/sites/default/files/qs-rankings-data/357051_indicators.txt'

r = requests.get(list_url_1)
top_universities = pd.DataFrame()
top_universities = top_universities.from_dict(r.json()['data'])[['uni', 'overall_rank', 'location', 'region']]
# get the university name and details URL with a regex
top_universities['name'] = top_universities['uni'].apply(lambda name: html.unescape(re.findall('<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>', name)[0][1]))
top_universities['url'] = top_universities['uni'].apply(lambda name: html.unescape(re.findall('<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>', name)[0][0]))
top_universities.drop('uni', axis=1, inplace=True)
top_universities['overall_rank'] = top_universities['overall_rank'].astype(int)

In [3]:
# selects the top N rows based on the colum_name of the dataframe df 
def select_top_N(df, column_name, N):
    df = df.sort_values(by=column_name)
    df = df[df[column_name] <= N]
    return df

In [57]:
# get only the first top-200 universities by overall rank
top_universities = select_top_N(top_universities, 'overall_rank', NUM_OBS)
top_universities.head()


Out[57]:
overall_rank location region name url students_total students_international students_national faculty_total faculty_international faculty_national
0 1 United States North America Massachusetts Institute of Technology (MIT) /universities/massachusetts-institute-technolo... 11067.0 3717.0 7350.0 2982.0 1679.0 1303.0
1 2 United States North America Stanford University /universities/stanford-university 15878.0 3611.0 12267.0 4285.0 2042.0 2243.0
2 3 United States North America Harvard University /universities/harvard-university 22429.0 5266.0 17163.0 4350.0 1311.0 3039.0
3 4 United States North America California Institute of Technology (Caltech) /universities/california-institute-technology-... 2255.0 647.0 1608.0 953.0 350.0 603.0
4 5 United Kingdom Europe University of Cambridge /universities/university-cambridge 18770.0 6699.0 12071.0 5490.0 2278.0 3212.0

In [58]:
students_total = []
students_inter = []
faculty_total = []
faculty_inter = []

def get_num(soup, selector):
    scraped = soup.select(selector)
    # Some top_universities don't have stats, return NaN for these case
    if scraped:
        return int(scraped[0].contents[0].replace(',', ''))
    else:
        return np.NaN


for details_url in tqdm_notebook(top_universities['url']):
    soup = BeautifulSoup(requests.get(root_url_1 + details_url).text, 'html.parser')
    
    students_total.append(get_num(soup, 'div.total.student div.number'))
    students_inter.append(get_num(soup, 'div.total.inter div.number'))
    faculty_total.append(get_num(soup, 'div.total.faculty div.number'))
    faculty_inter.append(get_num(soup, 'div.inter.faculty div.number'))


top_universities['students_total'] = students_total
top_universities['students_international'] = students_inter
top_universities['students_national'] = top_universities['students_total'] - top_universities['students_international']
top_universities['faculty_total'] = faculty_total
top_universities['faculty_international'] = faculty_inter
top_universities['faculty_national'] = top_universities['faculty_total'] - top_universities['faculty_international']

top_universities.head()



Out[58]:
overall_rank location region name url students_total students_international students_national faculty_total faculty_international faculty_national
0 1 United States North America Massachusetts Institute of Technology (MIT) /universities/massachusetts-institute-technolo... 11067.0 3717.0 7350.0 2982.0 1679.0 1303.0
1 2 United States North America Stanford University /universities/stanford-university 15878.0 3611.0 12267.0 4285.0 2042.0 2243.0
2 3 United States North America Harvard University /universities/harvard-university 22429.0 5266.0 17163.0 4350.0 1311.0 3039.0
3 4 United States North America California Institute of Technology (Caltech) /universities/california-institute-technology-... 2255.0 647.0 1608.0 953.0 350.0 603.0
4 5 United Kingdom Europe University of Cambridge /universities/university-cambridge 18770.0 6699.0 12071.0 5490.0 2278.0 3212.0

In [6]:
#defining colors for each type of plot
colors_1 = ['#FF9F9A', '#D0BBFF']
colors_2 = ['#92C6FF', '#97F0AA']
plt.style.use('ggplot')

Best universities in term of:

We selected the top 10 universities in point (a) and (b). For point (c) and (d), the top 200 universities were used in order to have more data.

(a) ratio between faculty members and students


In [63]:
top = 10
top_universities_ratio = select_top_N(top_universities, 'overall_rank', top)

In [64]:
top_universities_ratio_sf = top_universities_ratio[['name', 'students_total', 'faculty_total']]
top_universities_ratio_sf = top_universities_ratio_sf.set_index(['name'])
top_universities_ratio_sf.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
top_universities_ratio_sf.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig.autofmt_xdate()
plt.show()


Comments: We see that it is rather difficult to compare the ratios of the different universities. This is due to the different sizes of the population. In order to draw more precise information about it, we need to normalize the data with repect to each university.


In [65]:
# normalize the data to be able to make a good comparison
top_universities_ratio_normed = top_universities_ratio_sf.div(top_universities_ratio_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
top_universities_ratio_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
top_universities_ratio_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# we can restrict the range on the y axis to avoid displaying unnecessary content
axes.set_ylim([0.7,1])
fig.autofmt_xdate()
plt.show()


Comments: You noticed that the y-axis ranges from 0.7 to 1. We limited the visualization to this interval because the complete interval does not add meaningful insight about the data. Analyzing the results, we see that the Caltech university is the university in the top 10 offering more faculty members to its students. ETHZ is in the last position.

(b) ratio of international students


In [66]:
top_universities_ratio_s = top_universities_ratio[['name', 'students_international', 'students_national']]
top_universities_ratio_s = top_universities_ratio_s.set_index(['name'])
top_universities_ratio_s_normed = top_universities_ratio_s.div(top_universities_ratio_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
top_universities_ratio_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10, 5))
top_universities_ratio_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Top 10 ratio\'s of international and national students among universities')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.6])
fig.autofmt_xdate()
plt.show()


Comments: The most international university, by its students, among the top 10 universities is the Imperial College London. Notice that ETHZ is in the third position.

(c) same comparisons by country


In [67]:
ratio_country_sf = top_universities.groupby(['location'])['students_total', 'faculty_total'].sum()
ratio_country_sf_normed = ratio_country_sf.div(ratio_country_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_country_sf_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_sf_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by country')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()



In [68]:
ratio_country_s = top_universities.groupby(['location'])['students_international', 'students_national'].sum()
ratio_country_s_normed = ratio_country_s.div(ratio_country_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_country_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by country')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.4])
fig.autofmt_xdate()
plt.show()


Comments: Aggregating the data by country, we see that Russia is the country offering more faculty members for its student, followed by Danemark and Saudi Arabia. The most international university in terms of students is Australia, followed by United Kingdom and Hong Kong. Switzerland is in the fifth position and India is the country with the lowest ratio of international students.

(d) same comparisons by region


In [69]:
ratio_region_s = top_universities.groupby(['region'])['students_total', 'faculty_total'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_region_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by region')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()



In [70]:
ratio_region_s = top_universities.groupby(['region'])['students_international', 'students_national'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_region_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by region')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0,0.4])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()


Comments: Asia is the region offering more faculty members to its students. It is followed by North America and Europe. The most international university in terms of students is Oceania. Europe is second.

Analysis of the two methods

We get consistent results comparing the results obtained by region or by country about the ratio of international students. By country, we get Australia and by region, Oceania. This makes sense as Australia owns nine of the eleven top_universities of Oceania.

www.timeshighereducation.com


In [71]:
# we repeat the same procedure as for www.topuniversities.com
root_url_2 = 'https://www.timeshighereducation.com'
list_url_2 = root_url_2 + '/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

r = requests.get(list_url_2)
times_higher_education = pd.DataFrame()
times_higher_education = times_higher_education.from_dict(r.json()['data'])[['rank', 'location', 'location', 'name', 'url', 'stats_number_students', 'stats_pc_intl_students', 'stats_student_staff_ratio']]

# rename columns as is the first dataframe
times_higher_education.columns = ['overall_rank', 'location', 'region', 'name', 'url', 'students_total', 'ratio_inter', 'student_staff_ratio']


# as the ranks have different represetation we had to delete the '=' in front of universities that have the same rank,
# rewrite the rank when it is represented as an interval (ex: 201-250) and finally delete the '+' in the end for the last ones 
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: re.sub('[=]', '', rank))
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: rank.split('–')[0])
times_higher_education['overall_rank'] = times_higher_education['overall_rank'].apply(lambda rank: re.sub('[+]', '', rank)).astype(int)

# remaps a ranking in order to make selection by ranking easier
# ex: 1,2,3,3,5,6,7 -> 1,2,3,3,4,5,6
def remap_ranking(rank):
    last=0
    for i in range(len(rank)):
        if last == rank[i]-1:
            #no problem
            last = rank[i]
        elif last != rank[i]:
            last = last+1
            rank[i] = last
            rank[(i+1):] = rank[(i+1):]-1
    return rank

times_higher_education['overall_rank'] = remap_ranking(times_higher_education['overall_rank'].copy())


# in the following lines we make the necessary transformation in order to get the right type or numbers for each column
times_higher_education['students_total'] = times_higher_education['students_total'].apply(lambda x: re.sub('[^0-9]','', x)).astype(int) 
times_higher_education['ratio_inter'] = times_higher_education['ratio_inter'].apply(lambda x: re.sub('[^0-9]','', x)).astype(float) 
times_higher_education['student_staff_ratio'] = times_higher_education['student_staff_ratio'].astype(float) 

times_higher_education['students_international'] = (times_higher_education['students_total'] * (times_higher_education['ratio_inter']/100)).astype(int)
times_higher_education['students_national'] = times_higher_education['students_total'] - times_higher_education['students_international']

times_higher_education['faculty_total'] = (times_higher_education['students_total'] / times_higher_education['student_staff_ratio']).astype(int)
times_higher_education['faculty_international'] = np.NaN
times_higher_education['faculty_national'] = np.NaN 

times_higher_education['region'] = np.NaN

# resolve ties
times_higher_education['overall_rank'] = np.arange(1, times_higher_education.shape[0]+1)


# resolve N/A region
loc_to_reg = top_universities[['location', 'region']]
loc_to_reg = set(loc_to_reg.apply(lambda x: '{}_{}'.format(x['location'], x['region']), axis=1).values)
loc_to_reg = {x.split('_')[0]: x.split('_')[1] for x in loc_to_reg}
from collections import defaultdict
loc_to_reg = defaultdict(lambda: 'N/A', loc_to_reg)
def resolve_uni(x):
    x['region'] = loc_to_reg[x['location']]
    return x

times_higher_education = times_higher_education.apply(resolve_uni, axis=1)

del times_higher_education['ratio_inter']
del times_higher_education['student_staff_ratio']

In [72]:
# get only the first top-200 universities by overall rank
times_higher_education = select_top_N(times_higher_education, 'overall_rank', NUM_OBS)
times_higher_education.head()


Out[72]:
overall_rank location region name url students_total students_international students_national faculty_total faculty_international faculty_national
0 1 United Kingdom Europe University of Oxford /world-university-rankings/university-oxford 20409 7755 12654 1822 NaN NaN
1 2 United Kingdom Europe University of Cambridge /world-university-rankings/university-cambridge 18389 6436 11953 1687 NaN NaN
2 3 United States North America California Institute of Technology /world-university-rankings/california-institut... 2209 596 1613 339 NaN NaN
3 4 United States North America Stanford University /world-university-rankings/stanford-university 15845 3485 12360 2112 NaN NaN
4 5 United States North America Massachusetts Institute of Technology /world-university-rankings/massachusetts-insti... 11177 3800 7377 1284 NaN NaN

Best universities in term of:

We selected the top 10 universities in point (a) and (b). For point (c) and (d), the top 200 universities were used in order to have more data.

(a) ratio between faculty members and students


In [73]:
top = 10
times_higher_education_ratio = select_top_N(times_higher_education, 'overall_rank', top)

In [87]:
times_higher_education_ratio_sf = times_higher_education_ratio[['name', 'students_total', 'faculty_total']]
times_higher_education_ratio_sf = times_higher_education_ratio_sf.set_index(['name'])
times_higher_education_ratio_normed = times_higher_education_ratio_sf.div(times_higher_education_ratio_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
times_higher_education_ratio_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
times_higher_education_ratio_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Top 10 ratio\'s between students and faculty members among universities')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()


Comments: The university of Chicago is the faculty with the more faculty members by students. It is closely followed by the California Institute of Technology.

(b) ratio of international students


In [75]:
times_higher_education_ratio_s = times_higher_education_ratio[['name', 'students_international', 'students_national']]
times_higher_education_ratio_s = times_higher_education_ratio_s.set_index(['name'])
times_higher_education_ratio_s_normed = times_higher_education_ratio_s.div(times_higher_education_ratio_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
times_higher_education_ratio_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10, 5))
times_higher_education_ratio_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Top 10 ratio\'s of international and national students among universities')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.2, 0.6])
fig.autofmt_xdate()
plt.show()


Comments: The Imperial College Longon university has a strong lead in the internationalization of its student. Oxford and ETHZ are following bunched together.

(c) same comparisons by country


In [76]:
ratio_country_sf = times_higher_education.groupby(['location'])['students_total', 'faculty_total'].sum()
ratio_country_sf_normed = ratio_country_sf.div(ratio_country_sf.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_country_sf_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_sf_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by country')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
fig.autofmt_xdate()
plt.show()


Comments: Denmark is in the first position. We find the Russian Federation in the second place. This is the same result obtained with the top universities website the other way around. This shows that either the universities of each country have different ranking in each website or each website has different information about each university.


In [77]:
ratio_country_s = times_higher_education.groupby(['location'])['students_international', 'students_national'].sum()
ratio_country_s_normed = ratio_country_s.div(ratio_country_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_country_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(15, 5))
ratio_country_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by country')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0, 0.6])
fig.autofmt_xdate()
plt.show()


Comments: Luxembourg has more international than national students which allows it to be in first position without difficulty. Switzerland is in the sixth position (versus fifth for top university website).

(d) same comparisons by region


In [78]:
# Some countries have their field 'region' filled with 'N/A': this is due to the technique we used to write the
# correct region for each university. In the sample we are considering, let's see how many universities are concerned:
times_higher_education[times_higher_education['region'] == 'N/A']


Out[78]:
overall_rank location region name url students_total students_international students_national faculty_total faculty_international faculty_national
178 179 Luxembourg N/A University of Luxembourg /world-university-rankings/university-luxembourg 4969 2832 2137 340 NaN NaN
193 194 Russian Federation N/A Lomonosov Moscow State University /world-university-rankings/lomonosov-moscow-st... 29236 6431 22805 4004 NaN NaN

In [85]:
# As there is only two universities concerned, we can rapidly write it by hand. Of course we should have develop a
# more generalized manner to do it, if we had a much larger sample.
times_higher_education.set_value(178, 'region', 'Europe')
times_higher_education.set_value(193, 'region', 'Europe')


Out[85]:
overall_rank location region name url students_total students_international students_national faculty_total faculty_international faculty_national
0 1 United Kingdom Europe University of Oxford /world-university-rankings/university-oxford 20409 7755 12654 1822 NaN NaN
1 2 United Kingdom Europe University of Cambridge /world-university-rankings/university-cambridge 18389 6436 11953 1687 NaN NaN
2 3 United States North America California Institute of Technology (Caltech) /world-university-rankings/california-institut... 2209 596 1613 339 NaN NaN
3 4 United States North America Stanford University /world-university-rankings/stanford-university 15845 3485 12360 2112 NaN NaN
4 5 United States North America Massachusetts Institute of Technology (MIT) /world-university-rankings/massachusetts-insti... 11177 3800 7377 1284 NaN NaN
5 6 United States North America Harvard University /world-university-rankings/harvard-university 20326 5284 15042 2283 NaN NaN
6 7 United States North America Princeton University /world-university-rankings/princeton-university 7955 1909 6046 958 NaN NaN
7 8 United Kingdom Europe Imperial College London /world-university-rankings/imperial-college-lo... 15857 8721 7136 1390 NaN NaN
8 9 United States North America University of Chicago /world-university-rankings/university-chicago 13525 3381 10144 2181 NaN NaN
9 10 Switzerland Europe ETH Zurich - Swiss Federal Institute of Techno... /world-university-rankings/eth-zurich-swiss-fe... 19233 7308 11925 1317 NaN NaN
10 11 United States North America University of Pennsylvania /world-university-rankings/university-pennsylv... 20361 4072 16289 3132 NaN NaN
11 12 United States North America Yale University /world-university-rankings/yale-university 12155 2552 9603 2826 NaN NaN
12 13 United States North America Johns Hopkins University /world-university-rankings/johns-hopkins-unive... 15498 3719 11779 3604 NaN NaN
13 14 United States North America Columbia University /world-university-rankings/columbia-university 26587 8507 18080 4358 NaN NaN
14 15 United States North America University of California, Los Angeles (UCLA) /world-university-rankings/university-californ... 39279 6677 32602 4091 NaN NaN
15 16 United Kingdom Europe University College London /world-university-rankings/university-college-... 30304 14848 15456 2886 NaN NaN
16 17 United States North America Duke University /world-university-rankings/duke-university 15256 3356 11900 3390 NaN NaN
17 18 United States North America University of California, Berkeley (UCB) /world-university-rankings/university-californ... 36182 6150 30032 2761 NaN NaN
18 19 United States North America Cornell University /world-university-rankings/cornell-university 21850 5244 16606 2229 NaN NaN
19 20 United States North America Northwestern University /world-university-rankings/northwestern-univer... 17466 3143 14323 1364 NaN NaN
20 21 United States North America University of Michigan /world-university-rankings/university-michigan 41818 6690 35128 4862 NaN NaN
21 22 Singapore Asia National University of Singapore (NUS) /world-university-rankings/national-university... 30602 9180 21422 1800 NaN NaN
22 23 Canada North America University of Toronto /world-university-rankings/university-toronto 69427 11802 57625 3712 NaN NaN
23 24 United States North America Carnegie Mellon University /world-university-rankings/carnegie-mellon-uni... 12676 5704 6972 938 NaN NaN
24 25 United Kingdom Europe London School of Economics and Political Scien... /world-university-rankings/london-school-econo... 10065 7146 2919 825 NaN NaN
25 26 United States North America University of Washington /world-university-rankings/university-washington 44945 7191 37754 3942 NaN NaN
26 27 United Kingdom Europe The University of Edinburgh /world-university-rankings/university-edinburgh 26759 10703 16056 2140 NaN NaN
27 28 United States North America New York University (NYU) /world-university-rankings/new-york-university 43860 11403 32457 4568 NaN NaN
28 29 China Asia Peking University /world-university-rankings/peking-university 42136 6741 35395 4734 NaN NaN
29 30 China Asia Tsinghua University /world-university-rankings/tsinghua-university 42089 3788 38301 3072 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
170 171 South Africa Africa University of Cape Town /world-university-rankings/university-cape-town 20775 3739 17036 1775 NaN NaN
171 172 United States North America Rutgers, the State University of New Jersey /world-university-rankings/rutgers-state-unive... 43439 4343 39096 3913 NaN NaN
172 173 Sweden Europe KTH Royal Institute of Technology /world-university-rankings/kth-royal-institute... 12951 2201 10750 809 NaN NaN
173 174 Germany Europe University of Münster /world-university-rankings/university-munster 42287 2960 39327 992 NaN NaN
174 175 Belgium Europe Université Libre de Bruxelles /world-university-rankings/universite-libre-de... 25343 8870 16473 583 NaN NaN
175 176 United Kingdom Europe Newcastle University /world-university-rankings/newcastle-university 20952 6285 14667 1343 NaN NaN
176 177 United Kingdom Europe University of Liverpool /world-university-rankings/university-liverpool 19815 6935 12880 1693 NaN NaN
177 178 China Asia Zhejiang University /world-university-rankings/zhejiang-university 50051 3003 47048 3549 NaN NaN
178 179 Luxembourg Europe University of Luxembourg /world-university-rankings/university-luxembourg 4969 2832 2137 340 NaN NaN
179 180 Netherlands Europe University of Twente /world-university-rankings/university-twente 6697 1808 4889 523 NaN NaN
180 181 France Europe Paris-Sud University /world-university-rankings/paris-sud-university 27221 4355 22866 1877 NaN NaN
181 182 France Europe École Normale Supérieure de Lyon /world-university-rankings/ecole-normale-super... 2221 288 1933 261 NaN NaN
182 183 Hong Kong Asia The Hong Kong Polytechnic University /world-university-rankings/hong-kong-polytechn... 21773 5443 16330 809 NaN NaN
183 184 Italy Europe Scuola Normale Superiore di Pisa /world-university-rankings/scuola-normale-supe... 515 36 479 99 NaN NaN
184 185 United Kingdom Europe University of Aberdeen /world-university-rankings/university-aberdeen 12055 4339 7716 867 NaN NaN
185 186 United States North America University of Miami /world-university-rankings/university-miami 15860 2537 13323 2688 NaN NaN
186 187 United Kingdom Europe University of Dundee /world-university-rankings/university-dundee 10893 2614 8279 731 NaN NaN
187 188 United Kingdom Europe University of East Anglia /world-university-rankings/university-east-anglia 13521 4056 9465 994 NaN NaN
188 189 China Asia Shanghai Jiao Tong University /world-university-rankings/shanghai-jiao-tong-... 37288 2237 35051 3007 NaN NaN
189 190 Finland Europe Aalto University /world-university-rankings/aalto-university 12744 2548 10196 630 NaN NaN
190 191 United States North America University of Massachusetts /world-university-rankings/university-massachu... 59230 8292 50938 4453 NaN NaN
191 192 New Zealand Oceania The University of Auckland /world-university-rankings/university-auckland 30348 8800 21548 1614 NaN NaN
192 193 United States North America Northeastern University /world-university-rankings/northeastern-univer... 18780 6009 12771 1341 NaN NaN
193 194 Russian Federation Europe Lomonosov Moscow State University /world-university-rankings/lomonosov-moscow-st... 29236 6431 22805 4004 NaN NaN
194 195 Netherlands Europe Tilburg University /world-university-rankings/tilburg-university 7738 1005 6733 343 NaN NaN
195 196 France Europe Paris-Sorbonne University – Paris 4 /world-university-rankings/paris-sorbonne-univ... 22042 3306 18736 729 NaN NaN
196 197 United Kingdom Europe Royal Holloway, University of London /world-university-rankings/royal-holloway-univ... 8931 3572 5359 579 NaN NaN
197 198 United States North America University of California, Davis /world-university-rankings/university-californ... 20740 2903 17837 1001 NaN NaN
198 199 Sweden Europe University of Gothenburg /world-university-rankings/university-gothenburg 18378 2389 15989 1837 NaN NaN
199 200 Taiwan Asia National Taiwan University (NTU) /world-university-rankings/national-taiwan-uni... 31758 2540 29218 2761 NaN NaN

200 rows × 11 columns


In [80]:
ratio_region_s = times_higher_education.groupby(['region'])['students_total', 'faculty_total'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='faculty_total', ascending=False)
ratio_region_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_1, ax=axes)
axes.set_title('Ratio of students and faculty members by region')
axes.legend(labels=['students', 'faculty members'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0.8,1])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()



In [81]:
ratio_region_s = times_higher_education.groupby(['region'])['students_international', 'students_national'].sum()
ratio_region_s_normed = ratio_region_s.div(ratio_region_s.sum(1), axis=0).sort_values(by='students_international', ascending=False)
ratio_region_s_normed.index.name = None

fig, axes = plt.subplots(1, 1, figsize=(10,5), sharey=True)
ratio_region_s_normed.plot.bar(stacked=True, color=colors_2, ax=axes)
axes.set_title('Ratio of international and national students by region')
axes.legend(labels=['international students', 'national students'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
axes.set_ylim([0,0.4])
axes.yaxis.grid(True)
fig.autofmt_xdate()
plt.show()


Comments: In the first plot, we see that Africa is the region where there is more faculty members by students. The two following regions are very close to each other. In the second plot, Oceania is the more internationalized school in terms of its students and Europe is second. We had similar results by the other website concerning this last outcome.


In [82]:
# Detects same universities with different names in the two dataframes before merging
# using Jaccard similarity and same location rule (seems to keep matching entry)
def t(x):
    # Compute Jaccard score (intersection over union)
    def jaccard(a, b):
        u = set(a.split(' '))
        v = set(b.split(' '))
        return len(u.intersection(v)) / len(u.union(v))
        
    names = top_universities['name'].tolist()
    locations = top_universities['location'].tolist()
    scores = np.array([jaccard(x['name'], n) for n in names])
    m = scores.max()
    i = scores.argmax()
    # Jaccard score for similarity and location match to filter out name with different locations
    if m > 0.5 and x['location'] == locations[i]:
        x['name'] = names[i]
    return x
    
# Match universities name in both dataframes  
times_higher_education = times_higher_education.apply(t, axis=1)
# Intersection on the name column of the two datasets 
merged = pd.merge(top_universities, times_higher_education, on='name', how='inner')

merged.head()


Out[82]:
overall_rank_x location_x region_x name url_x students_total_x students_international_x students_national_x faculty_total_x faculty_international_x ... overall_rank_y location_y region_y url_y students_total_y students_international_y students_national_y faculty_total_y faculty_international_y faculty_national_y
0 1 United States North America Massachusetts Institute of Technology (MIT) /universities/massachusetts-institute-technolo... 11067.0 3717.0 7350.0 2982.0 1679.0 ... 5 United States North America /world-university-rankings/massachusetts-insti... 11177 3800 7377 1284 NaN NaN
1 2 United States North America Stanford University /universities/stanford-university 15878.0 3611.0 12267.0 4285.0 2042.0 ... 4 United States North America /world-university-rankings/stanford-university 15845 3485 12360 2112 NaN NaN
2 3 United States North America Harvard University /universities/harvard-university 22429.0 5266.0 17163.0 4350.0 1311.0 ... 6 United States North America /world-university-rankings/harvard-university 20326 5284 15042 2283 NaN NaN
3 4 United States North America California Institute of Technology (Caltech) /universities/california-institute-technology-... 2255.0 647.0 1608.0 953.0 350.0 ... 3 United States North America /world-university-rankings/california-institut... 2209 596 1613 339 NaN NaN
4 5 United Kingdom Europe University of Cambridge /universities/university-cambridge 18770.0 6699.0 12071.0 5490.0 2278.0 ... 2 United Kingdom Europe /world-university-rankings/university-cambridge 18389 6436 11953 1687 NaN NaN

5 rows × 21 columns

Insights

Here we first proceed by creating the correlation matrix (since it's a symetric matrix we only kept the lower triangle). We then plot it using a heatmap to see correlation between columns of the dataframe. We also made another heatmap with only correlation whose absolute value is greater than 0.5. Finally we averaged the features when they were available for the two websites (except rankings).

Correlations analysis

Some correlations bring interesting information:

  • $Corr(overall\_rank\_x, overall\_rank\_y) = 0.7$
    We get a strong correlation between the ranking of the first website and the second one. It shows us that the two website ranking methods lead on similar results (since the correlation is positive). It's insightfull since even if the features are approximately the same for the two websites, their methodology to attribute a rank could be really different. This important positive correlation reveals that the methodologies doesn't differ so much between the two websites.
  • $Corr(students\_international\_avg, faculty\_international\_avg) = 0.59$
    Here we have an interesting correlation between the number of international students and the number of international staff members.
  • We have strong correlation between same features but coming from different websites. It's not really interesting since difference in same features from the two websites are likely to be small. Also we have important correlation between "total" features and their sub-categories like "international" and "national". These are not interesting too because they follow a simple relation: when the total is higher, the sub-categories are likely to be higher numbers too (i.e. if we have more students, we are likely to have more national or international students).

In [83]:
merged_num = merged.select_dtypes(include=[np.number])
merged_num.dropna(how='all', axis=1)
merged_num.dropna(how='any', axis=0)

def avg_feature(x):
    cols = set([x for x in x.index if 'overall' not in x])
    cols_common = set([x[0:-2] for x in cols])
    for cc in cols_common:
        cc_x = '{}_x'.format(cc)
        cc_y = '{}_y'.format(cc)
        if cc_y in cols:
            x['{}_avg'.format(cc)] = (x[cc_x] + x[cc_y]) / 2
        else:
            x['{}_avg'.format(cc)] = x[cc_x] / 2
    for c in cols:
        del x[c]
    return x
    
merged_num_avg = merged_num.apply(avg_feature, axis=1)

merged_num.head()
corr = merged_num.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()

# Keep only correlation with and absolute value superior to 0.5 
corr[(corr < 0.5) & (corr > -0.5)] = 0
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()

# Keep only correlation with and absolute value superior to 0.5 for averaged features
corr = merged_num_avg.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
corr[(corr < 0.5) & (corr > -0.5)] = 0
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr, ax=ax, mask=mask, annot=True, square=True)
plt.show()


Best university

First we have to transform ranking in some score. Here we assume a linear relation for the score given the ranking, so we gave a score of 1 for the best ranking and 0 for the worst ranking with linear mapping between these two. We did it for each of the ranking (the two websites). Also we don't really know if a website is most trustworthy than the other, so a good merging for the ranking would be to take the average of the two scores with equal weights for each score. Finally we also took into account the ratio of staff member per students:

$finalScore = mean(score1, score2, staff per studiants)$

After computing these values, we found that Caltech is the best university (according to our assumptions).

Per Website ranking:

Caltech: top_universities -> 4 | times_higher_education = -> 3 | staff per student ratio -> 0.15 | => final score: 0.71


In [84]:
r = merged[['name', 'overall_rank_x', 'overall_rank_y']]
r.head()

def lin(df):
    best_rank = df.min() 
    worst_rank = df.max() 
    a = 1 / (best_rank - worst_rank)
    b = 1 - a*best_rank
    return df.apply(lambda x: a*x + b)

r['stud_staff_ratio'] = merged[['faculty_international_x', 'faculty_international_y']].mean(axis=1) / \
                        merged[['students_total_x', 'students_total_y']].mean(axis=1)
r['score_x'] = lin(r['overall_rank_x'])
r['score_y'] = lin(r['overall_rank_y'])
r['overall_score'] = r[['score_x', 'score_y', 'stud_staff_ratio']].mean(axis=1) 
r = r.dropna()

r[r['overall_score'] == r['overall_score'].max()]


Out[84]:
name overall_rank_x overall_rank_y stud_staff_ratio score_x score_y overall_score
3 California Institute of Technology (Caltech) 4 3 0.15681 0.984925 0.98995 0.710561